visual explanation
VanillaNet: the Power of Minimalism in Deep Learning (Supplementary Material)
The detailed architecture for VanillaNet with 7-13 layers can be found in Table 1, where each convolutional layer is followed with an activation function. For the VanillaNet-13-1.5, the number of channels are multiplied with 1.5. For classification on ImageNet, we train the VanillaNets for 300 epochs utilizing the cosine learning rate decay [5]. The ฮปis linearly decayed from 1 to 0 on epoch 0 and 100, respectively. The training details can be fould in Table 2.
FFAM: Feature Factorization Activation Map for Explanation of 3D Detectors
LiDAR-based 3D object detection has made impressive progress recently, yet most existing models are black-box, lacking interpretability. Previous explanation approaches primarily focus on analyzing image-based models and are not readily applicable to LiDAR-based 3D detectors. In this paper, we propose a feature factorization activation map (FFAM) to generate high-quality visual explanations for 3D detectors. FFAM employs non-negative matrix factorization to generate concept activation maps and subsequently aggregates these maps to obtain a global visual explanation. To achieve object-specific visual explanations, we refine the global visual explanation using the feature gradient of a target object. Additionally, we introduce a voxel upsampling strategy to align the scale between the activation map and input point cloud. We qualitatively and quantitatively analyze FFAM with multiple detectors on several datasets. Experimental results validate the high-quality visual explanations produced by FFAM.
b6f8dc086b2d60c5856e4ff517060392-Supplemental.pdf
InEXPAND,weaugmenteachhuman evaluated state to 5 states. To verify 5issufficient, we also experimented with the numbers of augmentations required in each state to get the best performance. AGIL [50] was designed to utilize saliency map collected via human gaze. The network architectures are shown in Figure 1. Hence, we view the output of attention network as the prediction of whether a pixel should be included in a human annotated boundingbox.